Introduction to Open Data Science is a course organized by Doctoral school in the humanities and social sciences (HYMY),University of Helsinki. As it is mentioned in the course page “The name of this course refers to THREE BIG THEMES: 1) Open Data, 2) Open Science, and 3) Data Science”. The learning platform for this course is MOOCs (Massive Open Online Courses) of the University of Helsinki: http://mooc.helsinki.fi. More information about the course can be found in this link: https://courses.helsinki.fi/en/hymy005/120776718.
In the first week of the course, I have gone through the general instructions of the course and get familiarize my self with the learning tools, the course content and schedule. Though I had previously experience with R and RStudio, I have done all the exercises and instructions given in “DataCamp: R Short and Sweet” and refresh my R. I completed all “R Short and Sweet” exercise and statement of accomplishment published here. The other exerices, I have done in this first week is “RStudio Exercise 1”. I already had a GitHub account and to follow the exercises and instructions for this exercise, I forked the IODS-project repository from Tuomo Nieminen’s github. After forking, I clone the IODS-project repository to my local machine (my computer) using the command git clone. After cloned the respostry to my computer, Using RStudio, I edited the existing Rmarkdown file (chapter1.Rmd, chapter2.Rmd and index.Rmd) in the repository and also add new Rmarkdown file and save as in the file name README.Rmd. Next, I commit the changes what have been done in the local machine using the git command git commit -m and finally push to github using the command git push. I have also canged the theme of my course diary web page to Merlot theme. In this exercises, I refresh my git and Github knowledge and also familiarize with how I will use the GitHub for this course to share and publish my learning diary.
In this section, the data set given in this link has been preprocess for further/downstream analysis. After creating a folder named ‘data’ in my IODS-project repository, using Rstudio a new R script named ‘create_learning2014’ file created and write all the steps used in the data wrangling process and saved in the ‘data’ folder. All the steps I have done in the data wrangling preprocess can be find here and/or in the code shown below.
The data for this section is extracted from a survey conducted by Kimmo Vehkalahti on teaching and learning approaches. The research project made possible by the Academy of Sciences funding to Teachers’ Academy funding (2013-2015). The data was collected from 3.12.2014 - 10.1.2015.The surrvey includes 183 respondants and the questioner in toatal includes 60 variables/questions.
students2014=read.table("data/learning2014.txt", sep="\t", header=TRUE)
dim(students2014)## [1] 166 7
str(students2014)## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
summary(students2014)## gender Age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
After preprocess the raw data, the final data set have 7 column “gender, age, attitude, deep, stra, surf and points” and 166 indvidual as shown in the above run.
plot_students2014=ggpairs(students2014, mapping = aes(col = gender,alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
plot_students2014The above plot shown there are 110 femal and 56 male respondants in the survey. Moreover the plot shown how the variables (Points, Age, attitude, deep, stra and Surf) are correleted. Based on the absolute correlation value the attitude and points is higly correleted while deep learning and ponts are least corrleted.
The explanatory variable chosen based on (absolute) correlation value and the top three explanatory variable chosen are attitude= 0.437, stra= 0.146 and surf= -0.144. Using this variables the model is build using the R function “lm”.
my_model1 <- lm(Points ~ attitude + stra + surf, data = students2014)
summary(my_model1)##
## Call:
## lm(formula = Points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The significance paramter from the above summary table is the intercept 0.00322 and attitude (1.93e-08) ***. Hence stra nad surf are not condisdered in the model below. using again “lm” function linear regrression model build between points and attitude.The model after removing insignificant variables is summarized below. With regard to multiple r-squared value, we saw that value changed from 0.1927 (in older model) to 0.1856 (newer model). However, F-Statistic (from 14.13 to 38.61) and p-value(3.156e-08 to 4.119e-09) have significantly improved. Thus, we can conclude that r-squared value may not always determine the quality of the model and the lower r-squared value might be due to outliers in the data.
my_model2 <- lm(Points ~ attitude , data = students2014)
summary(my_model2)##
## Call:
## lm(formula = Points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
par(mfrow = c(2,2))
plot(my_model2, which=c(1,2,5))The three diagnostic model validation plots are shown above.The assumptions are
Based on the above plots, we can conclude that the errors are normally distributed (clearly observed in q-q plot). Similarly, residual versus fitted model showed that the errors are not dependent on the attitude variable. Moreover, we can see that even two points (towards the right) have minor influence to the assumption in case of residual vs leverage model. All the models have adressed the outliers nicely. Thus, assumptions in all models are more or less valid.
library(GGally)
library(ggplot2)
library(ggpubr)## Loading required package: magrittr
library(tidyr)##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(dplyr)##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
library(gmodels)
library(boot)In this section Data wrangling has been done for two data sets retrieved from UCI Machine Learning Repository. The R script used for data wrangling can be found here
The data sets used in this analysis were retrieved from the UCI Machine Learning Repository.The data sets are intend to approach student achievement in two secondary education Portuguese schools. The data was collected by using school reports and questionnaires that includes data attributes (student grades, demographic, social and school related features). Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por) (Cortez et al. 2008). For this analysis, the original dataset (mat and por) have been modified so that the two separate datasets have been joined. Only students who are answered the questionnaire in both Portuguese class are kept. The final data sets used in this analysis includes a total of 382 observation and 35 attributes.
modified_data= read.table("data/modified_data.txt", sep="\t", header=TRUE)
colnames(modified_data)## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
In order to study the relationship between high/low alcohol consumption and variables. I chose four variables (“absences”, “sex”, “goout” and “studytime”). My hypothesis about how each variable is related to alcohol consumption shown below:
studytime: Students who dedicate quite much amount of time in their study, they don’t have time to go out for drinking alcohol (studytime and high/low alcohol consumption negatively correlated)
sex: Male students have higher alcohol consumption than female students (Male students and high/low alcohol consumption positively correlated)
goout: Those students going out with friends quite frequently consume high alcohol than students don’t going out. (goout and high/low alcohol consumption positively correlated)
absences: Those students who consume more alcohol don’t attend class for various reason (e.g hangover, attending party in class time ) (absences and high/low alcohol consumption positively correlated)
A bar plot for demonstrating the distributions of my chosen variable
my_variables= c("absences", "sex", "goout", "studytime", "high_use")
my_variables_data <- select(modified_data, one_of(my_variables))
colnames(my_variables_data)## [1] "absences" "sex" "goout" "studytime" "high_use"
gather(my_variables_data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()## Warning: attributes are not identical across measure variables;
## they will be dropped
A Box plot for demonstrating my chosen variables and their relationships with alcohol consumption
g1 <- ggplot(modified_data, aes(x = high_use, y = absences,col=sex))
p1=g1 + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")
g2 <- ggplot(modified_data, aes(x = high_use, y = goout, col=sex))
p2=g2 + geom_boxplot() + ylab("going out with friends") + ggtitle("Student who going out with friends by alcohol consumption and sex")
g3 <- ggplot(modified_data, aes(x = high_use, y = studytime, col=sex))
p3=g3 + geom_boxplot() + ylab("studytime - weekly study time") + ggtitle("Student who dedicate time to study by alcohol consumption and sex")
ggarrange(p1, p2, p3 , labels = c("A", "B","C"), ncol = 2, nrow = 2)#sex_high_use <- CrossTable(my_variables_data$sex, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#goout_high_use <- CrossTable(my_variables_data$goout, my_variables_data$high_use)
#studytime_high_use <- CrossTable(my_variables_data$studytime, my_variables_data$high_use)
To statistically explore the relationship between my chosen variable and high/low alcohol consumption variable logistic regression model was build using the R function glm().
m <- glm(high_use ~ absences + sex + goout + studytime , data = modified_data, family = "binomial")
In order to be able to interpret the fitted model in detail, the summary, coefficients and confidence intervals of the fitted model are calculated as shown below.
summary(m)##
## Call:
## glm(formula = high_use ~ absences + sex + goout + studytime,
## family = "binomial", data = modified_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9103 -0.7892 -0.5021 0.7574 2.6021
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.20483 0.59429 -5.393 6.94e-08 ***
## absences 0.07811 0.02244 3.481 0.000499 ***
## sexM 0.78173 0.26555 2.944 0.003242 **
## goout 0.72677 0.11994 6.059 1.37e-09 ***
## studytime -0.42116 0.17058 -2.469 0.013551 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 381.31 on 377 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 4
coef(m)## (Intercept) absences sexM goout studytime
## -3.20483157 0.07810746 0.78173290 0.72676824 -0.42116184
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp## Waiting for profiling to be done...
cbind(OR,CI)## OR 2.5 % 97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## absences 1.08123884 1.03577383 1.1324143
## sexM 2.18525582 1.30370562 3.7010763
## goout 2.06838526 1.64470314 2.6349716
## studytime 0.65628388 0.46510383 0.9099505
As shown above all the four variables are statistically significant. goout has the lowest p-value suggesting a strong association of going out with friends and high/low alcohol consumption. The positive coefficient for goout predictor suggests that all other variables being equal, going out with friends increase alcohol consumption. In the logit model, the response variable is log odds: \(ln(odds) = ln(p/(1-p)) =\alpha+ \beta*x_1 + \beta*x_2 + … + \beta*x_n\). A unit increase in going out with friends increase the log odds by 0.72677. The variable absences and sex have also lowest p-vlaue 0.000499 and 0.003242, respectively. The positive coefficient for absences suggests that all other variables being equal, a unit increase in the absences increase alcohol consumption. A unit increase in absences increase the log odds by 0.07811. sex is also the other significant value with p-value (0.003242) and suggesting association of the sex of the student with high\low alcohol consumption. The positive coefficient for sex suggests that all other variables being equal, the male students are high likely alcohol consumption. The male student increase the log odds by 0.78173. studytime is also the other significant variable and the negative coefficient for this variable suggests that all other variables being equal, a unit increase in studytime reduces the alcohol consumption.A unit increase in studytime reduces the log odds by by 0.42116.
The ratio of expected “success” to “failures” defined as odds: p/(1-p). Odds are an alternative way of expressing probabilities. Higher odds corresponds to a higher probability of success when OR > 1. The exponents of the coefficients of a logistic regression model interpreted as odds ratios between a unit change (vs no change) in the corresponding explanatory variable. The exponents of goout predictor coefficient is 2.06838526 and this suggests a unit change in the goout while all other the variables being equal, the odd ratio is 2.06838526. The exponents of sex predictor coefficient is 2.18525582 and this suggests a unit change in the sex while all other the variables being equal, the odd ratio is 2.18525582. In similar way, The exponents of absences predictor coefficient is 1.08123884 and this suggests a unit change in the absences while all other the variables being equal, the odd ratio is 1.08123884. The exponents of studytime predictor coefficient is 1.08123884 and this suggests a unit change in the studytime while all other the variables being equal, the odd ratio is 0.65628388. Hence odds ratio for the significant goout, sex and absences variables are 2.06838526, 2.1852558 and 1.08123884 respectively. It means that goout, sex and absences increase high alcohol consumption by the factor of around 2.07, 2.19 and 1.08. Whereas the odds ratio for studytime is 0.65628388 and it suggests that studytime decreases high alcohol consumption.
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'modified_data'
modified_data <- mutate(modified_data, probability = probabilities)
# use the probabilities to make a prediction of high_use
modified_data <- mutate(modified_data, prediction = probability>0.5)
table(high_use = modified_data$high_use, prediction = modified_data$prediction)## prediction
## high_use FALSE TRUE
## FALSE 252 16
## TRUE 65 49
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(modified_data$high_use, modified_data$probability)## [1] 0.2120419
The above 2x2 cross tabulation indicates that out of 382 observations 81 (65+ 16) were wrongly predicated. The average proportion of incorrect predictions in the data is about 21%. My model has 21% error which is better than the model in DataCamp exercises.
Bonus
# K-fold cross-validation
cv <- cv.glm(data = modified_data, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]## [1] 0.2094241
The 10-fold cross-validation result indicates is 0.2172775. There is no so much improvement in using the 10-fold cross-validation. There is no obvious smaller prediction error than what I have predicated in the above 2x2 cross tabulation (0.2120419).
#install.packages("corrplot") install corrplot package
#install.packages("kableExtra")
library(GGally)
library(ggplot2)
library(tidyr)
library(dplyr)
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)## corrplot 0.84 loaded
library(ggpubr)
library(magrittr)
library(boot)
library(knitr)
library(kableExtra)
#library("DT")# load the data
data("Boston")
str(Boston)## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)## [1] 506 14
The default installation of R comes with many (usually small) data sets. One of the data sets Boston we are dealing in this week exercise comes from MASS package. The data was originally published by (Harrison et al. 1978) that contains information about the Boston house-price data and later the data was also published by (Belsley et al. 1980). The Boston dataset has 506 observations and 14 different variables. Details about the datasets can be found from this two link [1] and [2]
Boston summary table
Boston_summary= do.call(cbind, lapply(Boston, summary))
kable(Boston_summary,"html", caption="Boston summary table") %>% kable_styling(bootstrap_options ="condensed", font_size = 13, full_width = F, position = "left") %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T, color = "white", background = "green")| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | 0.006320 | 0.00000 | 0.46000 | 0.00000 | 0.3850000 | 3.561000 | 2.9000 | 1.129600 | 1.000000 | 187.0000 | 12.60000 | 0.3200 | 1.73000 | 5.00000 |
| 1st Qu. | 0.082045 | 0.00000 | 5.19000 | 0.00000 | 0.4490000 | 5.885500 | 45.0250 | 2.100175 | 4.000000 | 279.0000 | 17.40000 | 375.3775 | 6.95000 | 17.02500 |
| Median | 0.256510 | 0.00000 | 9.69000 | 0.00000 | 0.5380000 | 6.208500 | 77.5000 | 3.207450 | 5.000000 | 330.0000 | 19.05000 | 391.4400 | 11.36000 | 21.20000 |
| Mean | 3.613524 | 11.36364 | 11.13678 | 0.06917 | 0.5546951 | 6.284634 | 68.5749 | 3.795043 | 9.549407 | 408.2372 | 18.45553 | 356.6740 | 12.65306 | 22.53281 |
| 3rd Qu. | 3.677083 | 12.50000 | 18.10000 | 0.00000 | 0.6240000 | 6.623500 | 94.0750 | 5.188425 | 24.000000 | 666.0000 | 20.20000 | 396.2250 | 16.95500 | 25.00000 |
| Max. | 88.976200 | 100.00000 | 27.74000 | 1.00000 | 0.8710000 | 8.780000 | 100.0000 | 12.126500 | 24.000000 | 711.0000 | 22.00000 | 396.9000 | 37.97000 | 50.00000 |
Visualizing the scatter plot matrix and examining the correltion between Boston dataset variables
p1=ggpairs(Boston,title="scatter plot matrix with correlation")
p1 + theme(plot.title = element_text(size = rel(2)))cor_matrix<-cor(Boston) %>% round(digits=2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "lower", cl.pos = "b", tl.col="black", tl.pos = "d", tl.cex = 1.2,title="Correlations plot",mar=c(0,0,1,0))
The above scatter plot matrix and correlation plot describe the distributions of the variables and the relationships between them. In the scatterplot matrix , the diagonal cells show histograms of each of the variables, the lower panel of the scatterplot matrix displayed a scatterplot of a pair of variables and the upper panel of the scatterplot matrix displayed the correlation value of a pair of variables. The correlation plot is a colored representation of the Boston correlation value whare the values are represented as different colors. The correlation values are ranging from red to blue (-1 to 1) and white is the middle value that is zero. For example, in the above correlation plot, we can see that the relation between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) represented in high intensity blue color circle and the scatter plot displayed a correlation value 0.91. Hence, as the index of accessibility to radial highways increases the full-value property tax rate per $10,000 also increase and vice versa. Moreover the above correlation plot displayed high intensity red color circle for the variables nitrogen oxides concentration (parts per 10 million) (nox) and weighted mean of distances to five Boston employment centres (dis) and the scatter plot matrix displayed the correlation value -0.77. Hence, as the nox increases, the dis decrease and vice versa.
boston_scaled <- scale(Boston)
boston_scaled<-as.data.frame(boston_scaled)Boston scaled summary table
Boston_summary_scaled= do.call(cbind, lapply(boston_scaled, summary))
kable(Boston_summary_scaled,"html") %>% kable_styling(bootstrap_options ="condensed", font_size = 13, full_width = F, position = "left") %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T, color = "white", background = "green")| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. | -0.4193669 | -0.4872402 | -1.5563017 | -0.2723291 | -1.4644327 | -3.8764132 | -2.3331282 | -1.2658165 | -0.9818712 | -1.3126910 | -2.7047025 | -3.9033305 | -1.5296134 | -1.9063399 |
| 1st Qu. | -0.4105633 | -0.4872402 | -0.8668328 | -0.2723291 | -0.9121262 | -0.5680681 | -0.8366200 | -0.8048913 | -0.6373311 | -0.7668172 | -0.4875567 | 0.2048688 | -0.7986296 | -0.5988631 |
| Median | -0.3902803 | -0.4872402 | -0.2108898 | -0.2723291 | -0.1440749 | -0.1083583 | 0.3170678 | -0.2790473 | -0.5224844 | -0.4642132 | 0.2745872 | 0.3808097 | -0.1810744 | -0.1449159 |
| Mean | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 | 0.0000000 |
| 3rd Qu. | 0.0073892 | 0.0487240 | 1.0149946 | -0.2723291 | 0.5980871 | 0.4822906 | 0.9059016 | 0.6617161 | 1.6596029 | 1.5294129 | 0.8057784 | 0.4332223 | 0.6024226 | 0.2682577 |
| Max. | 9.9241096 | 3.8004735 | 2.4201701 | 3.6647712 | 2.7296452 | 3.5515296 | 1.1163897 | 3.9566022 | 1.6596029 | 1.7964164 | 1.6372081 | 0.4406159 | 3.5452624 | 2.9865046 |
One of the main goal of standardizeing the data is to able compare different data setes. The Boston data setes is scaled using the R funciton scale(). From the above summary table of the scaled data set, we can clearly see that each variable has a mean 0.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
pairs(boston_scaled[1:13], main = "Scaled scatter plot matrix",
pch = 21, bg = c("red", "green3", "blue","yellow")[boston_scaled$crime],
oma=c(4,4,6,12),upper.panel = NULL)
par(xpd=TRUE)
legend(0.85, 0.7, as.vector(unique(boston_scaled$crime)),
fill=c("red", "green3", "blue","yellow"))
In scaling the Boston variable, subtract its mean from the variable and divided by its standard devaiation and all the mean became 0 and standard deviation 1. Moreover all the Boston varabiable is standardized using the formula \(\frac{X-\mu}{\sigma}\), there is a change in each of the varible values and we can clearly see big differnce in scaled Boston and orginal Boston data set. For example, if we compare the summary table of the scaled and orginal Boston data set the maximum and minum values are not the same and also the Quantile values. However the corrlation values has not changed in both datasets.
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2574257 0.2376238 0.2549505
##
## Group means:
## zn indus chas nox rm
## low 1.0181294 -0.9012680 -0.077423115 -0.8813836 0.4489710
## med_low -0.1128898 -0.2642062 -0.007331936 -0.5461982 -0.1613744
## med_high -0.3791541 0.2102954 0.137785540 0.4432902 0.1252475
## high -0.4872402 1.0170891 -0.119431971 1.1113501 -0.4204577
## age dis rad tax ptratio
## low -0.9252313 0.8892731 -0.6850891 -0.7303942 -0.45279942
## med_low -0.3214673 0.3026019 -0.5589261 -0.4739120 -0.01721261
## med_high 0.4058074 -0.4059987 -0.3896929 -0.2721808 -0.33599395
## high 0.8317035 -0.8663562 1.6384176 1.5142626 0.78111358
## black lstat medv
## low 0.37628946 -0.77110785 0.54514207
## med_low 0.31374782 -0.09407750 -0.01172177
## med_high 0.09925633 -0.03119295 0.19022133
## high -0.74429747 0.93451675 -0.76974319
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12403207 0.74000039 -0.83966506
## indus -0.01027535 -0.19459974 0.24040801
## chas -0.08453630 -0.01094597 0.10835703
## nox 0.43782585 -0.73908512 -1.41398724
## rm -0.07613708 -0.03224923 -0.15778131
## age 0.29782982 -0.30112429 -0.13679802
## dis -0.05591067 -0.20652286 0.02265899
## rad 3.03298093 0.98888906 -0.11246801
## tax -0.06864096 -0.08774866 0.54080492
## ptratio 0.08890535 0.03164932 -0.15445087
## black -0.14261608 0.02852735 0.10694737
## lstat 0.18747565 -0.13614712 0.50633296
## medv 0.12505597 -0.40798302 -0.10264502
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9426 0.0420 0.0154
From the above prior probabilities, we can see that the proportion of each groups “low, med_low, med_high, high” in the total observation. Approximately the counts of each group is about 25% of the total observation of the Boston data set. Moreover the group means displayed the mean values of each variable in each groups and we can see that how the mean differs between the groups. Coefficients of linear discriminants is the coefficient of each of variables in the linear combination of the variables. For example the first discriminant function is a linear combination of the variables: \(0.0888*zn + 0.0891*indus -0.0846*chas + 0.165*nox..........+ 0.179*lstat + 0.221*medv\). The proportion of trace explains the between groups variance, in my analysis we can see that the first discriminant explains more than 95% of the between group variance in the Boston dataset
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 1.4)
In the above biplot, the index of accessibility to radial highways (rad) has the largest possible variance (that is, accounts for as much of the variability in the Boston data as possible).
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)## predicted
## correct low med_low med_high high
## low 11 14 1 0
## med_low 5 17 0 0
## med_high 1 13 16 0
## high 0 0 0 24
Every time I run the code the cross-tabulation value is differnt due to random sample of the train and test data.However in the cross-tabulation, we can see that my model correctly predicted approximately about 70% and my model Misclassification rate is approximately 30%
data('Boston')
boston_scaled2 <- scale(Boston)
dist_eu <- dist(boston_scaled2)
set.seed(12345)
k_max <- 13
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
#qplot(x = 1:k_max, y = twcss, geom =c("point", "line"),span = 0.2)
# k-means clustering
b=x = 1:k_max
aa=data.frame(cbind(b,twcss))
ggplot(data=aa, aes(x=b, y=twcss, group=1)) +
geom_line(color="red")+
geom_point() + ggtitle("Optimal Number of Clusters")One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. Using the Elbow method, for each successive increase in the number of clusters, the substantial decrease in the within groups sum of squares wcss was observed. The optimal number of clusters is identified when the radical total WCSS drops observed in the line plot. From the above ggpairs plot, we can say that after 2 clusters the observed difference in the within-cluster dissimilarity is not radical. Consequently, we can say with some reasonable confidence that the optimal number of clusters to be used is 2.
Assuming the above optimal number cluster is valid and the K-Means algorithm run again and the plot results are displayed below:
km <-kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2, main = "K-means clustering",col = km$cluster, upper.panel = NULL)Bonus
boston_scaled_bonus<-as.data.frame(scale(Boston))
set.seed(12345)
km_bonus<-kmeans(dist_eu, centers = 4)
myclust<-data.frame(km_bonus$cluster)
boston_scaled_bonus$clust<-km_bonus$cluster
lda.fit_bonus<-lda(clust~., data = boston_scaled_bonus)
#lda.fit_bonus
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=2.5)
}
# plot the lda results
plot(lda.fit_bonus, dimen = 2, col=classes)
lda.arrows(lda.fit_bonus, myscale = 2.6) In the above biplot, the proportion of non-retail business acres per town (indus) has the largest possible variance (that is, accounts for as much of the variability in the Boston data as possible) when I used the cluster number 4.
#install.packages("FactoMineR")
library(GGally)
library(ggplot2)
library(tidyr)
library(dplyr)
library(corrplot)
library(ggpubr)
library(magrittr)
library(boot)
library(knitr)
library(kableExtra)
library(FactoMineR)human<-read.table("data/human.csv")
str(human)## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)## [1] 155 8
This week, I am dealing with “human” data which is generated by combing two dataset “Human development” and “Gender inequality”. The original data “Human development” and “Gender inequality” originated from the United Nations Development Programme (UNDP) and additional information about the data can be found [1] and [2]. Before combing the two dataset, “Gender inequality” data was mutated by creating two new variables. The first new variable created is the ratio of Female and Male populations with secondary education in each country. (i.e. edu2F / edu2M). The second new variable is the ratio of labour force participation of females and males in each country (i.e. labF / labM). Using the variable Country as the identifier, the two datasets join together. After combing the dataset further modified, by excluding unneeded variables, removing all rows with missing value and keeping observations which related to countries. Finally the human dataset have 155 observations and 8 variables. All the eight variables are continues variable.
human_summary= do.call(cbind, lapply(human, summary))
kable(human_summary,"html", caption="human data summary table") %>% kable_styling(bootstrap_options ="condensed", font_size = 14, full_width = F) %>% column_spec(1:1, bold = T,color = "white", background = "green") %>% row_spec(0, bold = T, color = "white", background = "green")| Edu2.FM | Labo.FM | Edu.Exp | Life.Exp | GNI | Mat.Mor | Ado.Birth | Parli.F | |
|---|---|---|---|---|---|---|---|---|
| Min. | 0.1717172 | 0.1856946 | 5.40000 | 49.00000 | 581.0 | 1.0000 | 0.60000 | 0.00000 |
| 1st Qu. | 0.7264088 | 0.5984174 | 11.25000 | 66.30000 | 4197.5 | 11.5000 | 12.65000 | 12.40000 |
| Median | 0.9375000 | 0.7534669 | 13.50000 | 74.20000 | 12040.0 | 49.0000 | 33.60000 | 19.30000 |
| Mean | 0.8528640 | 0.7074314 | 13.17613 | 71.65355 | 17627.9 | 149.0839 | 47.15871 | 20.91161 |
| 3rd Qu. | 0.9968404 | 0.8535430 | 15.20000 | 77.25000 | 24512.0 | 190.0000 | 71.95000 | 27.95000 |
| Max. | 1.4967320 | 1.0380368 | 20.20000 | 83.50000 | 123124.0 | 1100.0000 | 204.80000 | 57.50000 |
Visualizing the scatter plot matrix and examining the correltion between human data variables
p1=ggpairs(human,title="scatter plot matrix with correlation")
p1 + theme(plot.title = element_text(size = rel(2)))
The above scatter plot matrix indicates that some of the variables are skewed: for example the variables GIN, Mat.Mor and Ado.Birth are positively skewed. Whereas, Life.Exp and Labo.FM are negatively skewed. From the scatter plot matrix we can also see that the variables Edu2.FM and Edu.Exp are relatively follow normal distribution.
cor_matrix<-cor(human) %>% round(digits=2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "lower", cl.pos = "b", tl.col="black", tl.pos = "d", tl.cex = 1.2,title="Correlations plot",mar=c(0,0,1,0))The above correlation plot shows the relationships between variables. The correlation plot is a colored representation of the human data correlation value where the values are represented as different colors. The correlation values are ranging from red to blue (-1 to 1) and white is the middle value that is zero. For example adolescent birth rate (Ado.Birth) is positively correlated with maternal mortality ratio which is represented in high intensity blue color circle and the scatter plot displayed a correlation value (0.759) but negatively correlated (-0.857) with life expectancy at birth (Life.Exp). Similarly, ratio of females and males with secondary education (Edu2.FM) and expected years of schooling (Edu.Exp) are both positively correlated with life expectancy at birth (Life.Exp). On the other hand, there is very little correlation between the ratio of females and males in labour force (Labo.FM) with Edu.Exp and GNI.
pca_human_no_standard <- prcomp(human)
pca_human_no_standard## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4
## Edu2.FM -5.607472e-06 0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM 2.331945e-07 -0.0002819357 5.302884e-04 -4.692578e-03
## Edu.Exp -9.562910e-05 0.0075529759 1.427664e-02 -3.313505e-02
## Life.Exp -2.815823e-04 0.0283150248 1.294971e-02 -6.752684e-02
## GNI -9.999832e-01 -0.0057723054 -5.156742e-04 4.932889e-05
## Mat.Mor 5.655734e-03 -0.9916320120 1.260302e-01 -6.100534e-03
## Ado.Birth 1.233961e-03 -0.1255502723 -9.918113e-01 5.301595e-03
## Parli.F -5.526460e-05 0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## Edu2.FM -0.0022935252 2.180183e-02 6.998623e-01 7.139410e-01
## Labo.FM 0.0022190154 3.264423e-02 7.132267e-01 -7.001533e-01
## Edu.Exp 0.1431180282 9.882477e-01 -3.826887e-02 7.776451e-03
## Life.Exp 0.9865644425 -1.453515e-01 5.380452e-03 2.281723e-03
## GNI -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor 0.0266373214 1.695203e-03 1.355518e-04 8.371934e-04
## Ado.Birth 0.0188618600 1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F -0.0716401914 -2.309896e-02 -2.642548e-03 2.680113e-03
s_no_standard=summary(pca_human_no_standard)
pca_pr_no_standard <- round(100*s_no_standard$importance[2, ], digits = 1)
pc_lab_no_standard=paste0(names(pca_pr_no_standard), " (", pca_pr_no_standard, "%)")
biplot(pca_human_no_standard, cex = c(0.8, 1), col = c("black", "red"), xlab = pc_lab_no_standard[1], ylab = pc_lab_no_standard[2])## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
pca_human_standard <- prcomp(human,scale. = TRUE)
pca_human_standard## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## Labo.FM 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## Edu.Exp -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## Life.Exp -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## Parli.F -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## Edu2.FM 0.17713316 0.05773644 0.16459453
## Labo.FM -0.03500707 -0.22729927 -0.07304568
## Edu.Exp -0.38606919 0.77962966 -0.05415984
## Life.Exp -0.42242796 -0.43406432 0.62737008
## GNI 0.11120305 -0.13711838 -0.16961173
## Mat.Mor 0.17370039 0.35380306 0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F 0.13749772 0.00568387 -0.02306476
s_standard=summary(pca_human_standard)
pca_pr_standard <- round(100*s_standard$importance[2, ], digits = 1)
pc_lab_standard=paste0(names(pca_pr_standard), " (", pca_pr_standard, "%)")
biplot(pca_human_standard, cex = c(0.8, 1), col = c("black", "red"), xlab = pc_lab_standard[1], ylab = pc_lab_standard[2])There is a big difference between the results of PCA with and without standardizing data. As it is explained in the slide “PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance”. Hence Performing PCA on unstandardized variables will results to large loadings for variables with larger variance.For example: From the summary table, we can clearly see that among eight variables, the GIN variable has a large variance. this will lead to dependence of a principal component on the GIN variable with high variance. You can see, first principal component is dominated by a variable GIN.PC1 explains nearly all of the variance (99.99%).
When the variables are scaled, we get a much better representation of variables.The first two principal components explained about ~70% of the variance. The first principal component (PC1) explains 53.% of the variation compared to 100% when the data was not scaled.The second principal component (PC2) explains 16.2% of the variance.
From the standardized PCA plot, we can see that four variables, namely Edu.Exp, Life.Exp, GNU and EDU.FM are positively correlated since the angle between these variables arrows are relatively small, out of which GNU and EDU2.FM have the highest correlation since the angle between these two variables are smallest. Similarly, Parli.F and Labo.FM are also positively correlated and so are the variables Mat.Mor and Ado.Birth. Whereas Parli.F and Labo.FM are almost orthogonal to the other variables and show small correlation value. Furthermore, the plot also shows that Life.Exp and Ado.Birth are least correlated as they are farthest in the plot (notice the large angle between these two variables). From the biplot, we cal also see that Parli.F and Labo.FM are positively correlated to PC2 (i.e they are contributing the direction of PC2). whereas other variables are positively correlated to PC1 (i.e they are contributing the direction of PC1).
The tea data we are dealing in this week exercise comes from comes the R package “FactoMineR”. It is data frame with 300 rows and 36 columns. The data used here came from the survey conducted on a sample of 300 tea consumers.
data("tea")
str(tea)## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)## [1] 300 36
#g1<-gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")
#g1+ geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
summary(tea)## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
Among 36 columns (variables), I choose the following variables:
#select my interest of variables
keep_columns <- c("Tea", "home", "sugar", "always", "lunch","slimming", "health")
#my interest of new dataset
tea_my <- dplyr::select(tea, one_of(keep_columns))## Warning: Unknown variables: `health`
str(tea_my)## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ slimming: Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
g1<-gather(tea_my) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") ## Warning: attributes are not identical across measure variables;
## they will be dropped
g1+ geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))summary(tea_my)## Tea home sugar always
## black : 74 home :291 No.sugar:155 always :103
## Earl Grey:193 Not.home: 9 sugar :145 Not.always:197
## green : 33
## lunch slimming
## lunch : 44 No.slimming:255
## Not.lunch:256 slimming : 45
##
# multiple correspondence analysis
mca <- MCA(tea_my, graph = FALSE)
# summary of the model
summary(mca)##
## Call:
## MCA(X = tea_my, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.224 0.190 0.175 0.165 0.149 0.140
## % of var. 19.213 16.293 14.997 14.152 12.759 11.962
## Cumulative % of var. 19.213 35.506 50.503 64.655 77.414 89.376
## Dim.7
## Variance 0.124
## % of var. 10.624
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.262 0.102 0.082 | -0.225 0.089 0.060 | 0.321 0.197
## 2 | -0.734 0.801 0.661 | -0.149 0.039 0.027 | 0.379 0.274
## 3 | -0.142 0.030 0.051 | -0.183 0.059 0.084 | 0.092 0.016
## 4 | 0.330 0.162 0.259 | -0.260 0.118 0.160 | 0.034 0.002
## 5 | 0.022 0.001 0.001 | 0.275 0.133 0.120 | -0.395 0.298
## 6 | -0.142 0.030 0.051 | -0.183 0.059 0.084 | 0.092 0.016
## 7 | -0.142 0.030 0.051 | -0.183 0.059 0.084 | 0.092 0.016
## 8 | -0.734 0.801 0.661 | -0.149 0.039 0.027 | 0.379 0.274
## 9 | -0.142 0.030 0.051 | -0.183 0.059 0.084 | 0.092 0.016
## 10 | -0.734 0.801 0.661 | -0.149 0.039 0.027 | 0.379 0.274
## cos2
## 1 0.123 |
## 2 0.176 |
## 3 0.021 |
## 4 0.003 |
## 5 0.248 |
## 6 0.021 |
## 7 0.021 |
## 8 0.176 |
## 9 0.021 |
## 10 0.176 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | -1.138 23.767 0.424 -11.263 | 0.235 1.190 0.018
## Earl Grey | 0.543 14.097 0.532 12.607 | 0.143 1.161 0.037
## green | -0.622 3.166 0.048 -3.782 | -1.365 17.969 0.230
## home | -0.084 0.515 0.231 -8.306 | 0.028 0.067 0.026
## Not.home | 2.731 16.640 0.231 8.306 | -0.908 2.170 0.026
## No.sugar | -0.648 16.143 0.449 -11.589 | 0.096 0.420 0.010
## sugar | 0.693 17.257 0.449 11.589 | -0.103 0.449 0.010
## always | 0.306 2.389 0.049 3.825 | 0.787 18.660 0.324
## Not.always | -0.160 1.249 0.049 -3.825 | -0.412 9.756 0.324
## lunch | 0.561 3.430 0.054 4.021 | 0.149 0.287 0.004
## v.test Dim.3 ctr cos2 v.test
## black 2.321 | 0.714 11.963 0.167 7.060 |
## Earl Grey 3.331 | -0.008 0.004 0.000 -0.179 |
## green -8.298 | -1.555 25.336 0.299 -9.453 |
## home 2.762 | -0.084 0.655 0.229 -8.276 |
## Not.home -2.762 | 2.721 21.165 0.229 8.276 |
## No.sugar 1.721 | 0.070 0.243 0.005 1.256 |
## sugar -1.721 | -0.075 0.260 0.005 -1.256 |
## always 9.844 | -0.803 21.099 0.337 -10.043 |
## Not.always -9.844 | 0.420 11.032 0.337 10.043 |
## lunch 1.071 | 0.612 5.227 0.064 4.385 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.552 0.232 0.392 |
## home | 0.231 0.026 0.229 |
## sugar | 0.449 0.010 0.005 |
## always | 0.049 0.324 0.337 |
## lunch | 0.054 0.004 0.064 |
## slimming | 0.010 0.545 0.022 |
About 50.5 % of the variation is captured by first three dimensions, firs one capturing 19.21% of the variance, second one 16.29% and third one 14.99% of the variance.The Categorical variables shows the squared correlations between the variables and dimensions. Type of Tea has strong squared correlation to first dimension (0.552) and modest squared correlation to both dimensions two (0.232) and three (0.392 ). slimming has strong squared correlation to dimension two (0.545) and smallest squared correlation to both dimensions one (0.010) and three (0.022).
# visualize MCA
plot(mca, invisible=c("ind"),habillage = "quali")
The MCA biplot shows how similar the variables are with each other. It seems that ‘type of tea’ and ‘drinking tea after lunch’ are very similar, and ‘not tea time’ and ‘male’. The type of tea most similar with ‘sugar’ is ‘Earl Grey’ dringking after ‘lunch’.